Rational metacognition in memory

The introduction below is mostly the same as the previous versions. New sections are marked NEW.

Summary

Memory recall is often modeled as a process of evidence accumulation. Existing models typically assume that this accumulation process is passive, and not subject to top-down control. In contrast, recent work in perceptual and value-based decision making has suggested that similar kinds of evidence accumulation processes are guided by attention, such that evidence for attended items is accumulated faster than for non-attended items. Furthermore, attention may be adaptively allocated to different items in order to optimize a tradeoff between decision quality and the computational cost of evidence accumulation. In this project, we ask whether similar forces are at play in the context of memory recall.

Such a model predicts that, when multiple memories are relevant, people will focus their efforts on recalling the target which is more strongly represented in memory, because it can be recalled with less effort. Here we present a simple form of such a model, and test this key prediction in a cued-recall experiment in which participants can select which of two possible targets to remember. We find support for a model in which memory search is guided by partial recall progress in order to minimize the time spent recalling.

Model

We model memory recall as a process of evidence accumulation. As in the DDM or LCA, we assume that evidence is sampled at each time step and that recall occurs when the total evidence hits a threshold. To make the model solvable (by dynamic programming) we assume that the evidence for each target follows a Bernoulli distribution \[ x_t \sim \text{Bernoulli}(p), \] where \(p\) corresponds to the strength of the memory. This image shows several possible traces of evidence accumulation for a single item:

knitr::include_graphics("figs/accumulation.png")

When multiple memories are relevant, each one has a separate accumulator. Critically, we do not assume that evidence is sampled for each item in parallel. Instead, at each time step, the agent must select one of the targets and accumulates evidence for only that target. This induces a metalevel control problem: which target should the agent focus on at each moment, given only the current state of the accumulators?

This problem can be formalized as a Markov decision process in which the states correspond to the total evidence accumulated and time spent for each item (thus, the state is 4 dimensional). Because we use a discrete accumulation process, we can solve it exactly by dynamic programming. We find that the optimal policy generally converges on the target with maximal memory strength (highest \(p\)) and only draws samples for that target until it is recalled. This is illustrated in the following plot:

knitr::include_graphics("../model/figs/simple_fixation.png")

Alternative models NEW

In order to discern which model predictions are specific to a rational meta-memory model, we show predictions of two alternative models. The Random model randomly samples fixation durations from the empirical distribution. We also A more sophisticated Random Commitment model was developed to account for the empirical fact that last fixations are considerably longer than all other fixations. This model samples the total number of fixations on each trial from the empirical distribution and then samples the duration of each from the empirical distribution of non-final fixations. If a target is not recalled before reaching the final fixation, the model continues to fixate on the current cue until the target is recalled or the 15 second time limit is reached.

Experiment

To test the model’s predictions, we developed a modified cued-recall experiment in which participants were presented with two cues (images) on each trial and could recall the target (word) associated with either one. To create an observable behavioral correlate of targeted memory search, only one cue is visible at a time and participants use the keyboard to display each in turn. This is basically a cheap alternative to eye-tracking. The assumption is that people will look at the image they are currently trying to remember the word for. See a demo here.

knitr::include_graphics("figs/task.png")

Measuring memory strength NEW

The model predicts that people will spend more time looking at the cue for which the memory of the corresponding target is stronger, that is the cue for which \(p\) is higher. Unfortunately, we cannot measure \(p\). However, we can collect a noisy signal of this parameter using an auxiliary task. Concretely, we use reaction time in a 2-AFC task in which participants are presented with a word and must select the matching image.

To map this measure onto the model’s \(p\) parameter, we take advantage of the fact that the expected time to reach threshold in the model is \(E[t] = \theta / p\) which implies that \(\log E[t] = \log(\theta) - \log(p)\). This suggests that, in broad strokes, the \(p\) parameter should be log-linearly related to response times. The exact nature of this relationship is unclear, however, given that the 2-AFC task is quite different from a cued recall task. Thus, we simply normalize (Z-score) the \(\log(p)\) parameter and the 2AFC-RT measure (the latter within subject) to put them on roughtly the same scale. Finally, to account for the fact that the 2AFC measure is very noisy, we corrupt \(\log(p)\) with \(\sigma=3\) Gaussian noise.

In the simulations, we sample the parameter \(p\) from a \(\text{Beta}(2, 6)\) distribution, which was chosen because it makes some of the plots look better. This corresponds to an assumption about the range of accumulation rates (memory strengths) that are likely.

Timeline NEW

  • Exposure: each of 40 pairs shown once
  • Test: cued-recall image to word, no feedback, each image shown once
  • Exposure
  • Distractor: 30 seconds of simple arithmetic
  • Test: each image shown twice. Performance used as memory strength index
  • Critical: double-cue recall, each pair shown once (two per trial, so twenty trials)

Main Results

  • Dropping 3 participants with less than 50% accuracy in critical trials
  • Dropping error trials
  • Z-scoring fixation durations
  • 96 participants and 1630 trials in the analysis

Fixation time course

The key prediction of a dynamic optimal sampling model such as the one we propose is that a cue with stronger memory strength should receive greater attention. This is because such a cue can be rememberd more quickly and thus presents a lower-cost route to recall. Furthermore, this focusing effect should be stronger later in the trial, when the agent has had sufficient time to determine which cue is stronger. This is illustrated in the following plot, which shows the probability of fixating the stronger cue as a function of progress in the trial (0 is stimuli onset, 1 is response onset) and the size of the difference in strength.

normalized_timestep = function(long) {
    long %>% 
        group_by(trial) %>%
        mutate(prop_duration = duration / sum(duration)) %>% 
        ungroup() %>% 
        mutate(n_step=round(prop_duration * 100)) %>% 
        uncount(n_step) %>% 
        group_by(trial) %>% 
        mutate(normalized_timestep = row_number())
}

long %>% 
    normalized_timestep %>% 
    drop_na(strength_diff) %>% 
    ggplot(aes(normalized_timestep/100, fix_stronger, group = strength_diff, color=strength_diff)) +
    geom_smooth(se=F) + 
    ylim(0, 1) +
    facet_grid(~name) +
    labs(x="Normalized Time", y="Probability Fixate\nStronger Cue", color="Strength Difference") +
    geom_hline(yintercept=0.5) +
    theme(legend.position="top")

In both the optimal model simulations and the human data, the probability of fixating the stronger cue steadily increases over the course of the trial, and this increase is more pronounced when there is a larger difference in strength.

In the random model, we only see a slight spike at the end of the trial. This spike is due to the fact that the model always remembers the item it looks at last, and it is also more likely to remember the item with a stronger memory.

The random commitment model does not show this effect because the item it fixates last (and thus remembers) is usually determined by the random commitment.

Overall fixation proportion

Consisent with a trend to look more at stronger-memory items, we see that the total proportion of time spent fixating on an item increases with its memory strength, relative to the other item.

df %>% 
    filter(n_pres >= 2) %>% 
    regress(rel_strength, prop_first)
N = 1047
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.506 0.009 55.239 76.425 0.000
rel_strength 0.076 0.005 14.056 74.021 0.000
p values calculated using Satterthwaite d.f.

Last-fixation effect

However, the overal proportion effect is qualified by an interaction with the target of the last fixation. People overwhelmingly tend to select the last-viewed cue. In value-based decision making, it has been suggested that this “last-fixation effect” could explain away what looks to be evidence for adaptive attention allocation. Briefly, the last fixation is correlated with both fixation proportion and relative strength, and this could be entirely driving the correlation between strength and fixation proportion.

df %>% 
    filter(n_pres >= 2) %>% 
    regress_interaction(rel_strength, last_pres, prop_first)
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.722 0.009 77.677 65.982 0.000
rel_strength 0.003 0.004 0.638 487.376 0.524
last_pressecond -0.394 0.014 -28.843 58.076 0.000
rel_strength:last_pressecond 0.001 0.006 0.099 130.282 0.921
p values calculated using Satterthwaite d.f.

It is worth noting, however, that we cannot reproduce an effect of comparable strength in the random model (with some effort, but not quite an exhaustive search, which we will do at some point).

Why is it that the optimal model and human data both show the total proportion effect but the random model seemingly cannot? We think it is due to a combination of the following two effects:

Last fixations are longer

long %>% 
    # filter(name == "Human") %>% 
    ggplot(aes(last_fix==1, duration)) + 
    stat_summary(fun.data=mean_cl_boot, geom="bar", fill="white", color="black") +
    stat_summary(fun.data=mean_cl_boot, geom="errorbar", width=0.2) +
    facet_grid(~name) + 
    scale_x_discrete(name="Fixation Type", labels=c("Non-final", "Final")) +
    ylab("Duration")

last_diff = long %>% 
    filter(name == "Human") %>% 
    with(tapply(duration, last_fix, mean)) %>% 
    diff

long %>% 
    filter(name == "Human") %>% 
    mutate(is_last = int(last_fix==1)) %>% 
    lmer(duration ~ is_last + (is_last|wid), data=.) %>% 
    summ
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 795.197 29.823 26.664 71.299 0.000
is_last 841.772 54.316 15.498 92.057 0.000
p values calculated using Satterthwaite d.f.

The reason we see this in the optimal model is that the trial doesn’t stop when the model decides which cue has a stronger memory (as in value-based and perceptual tasks). Instead, the model needs to continue fixating the that cue until it remembers it. By this logic, the long last fixation is evidence that at some point people commit to remembering one of the cues. This is what initially motivated the random commitment model.

Side note: this is a cool result because we see the opposite (shorter final fixations) in value-based and perceptual decisions. The shorter last fixation effects is typically explained as the result of crossing a threshold and cutting off the final fixation.

Strength predicts last fixation

df %>% 
    mutate(last_pres_first = as.numeric(last_pres == "first")) %>% 
    regress(rel_strength, last_pres_first)
N = 1630
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.619 0.023 27.469 96.113 0.000
rel_strength 0.173 0.009 19.456 94.485 0.000
p values calculated using Satterthwaite d.f.

This effect occurs mechanistically (i.e., in the random model) because (1) the currently fixated cue is almost always the one remembered, (2) remembering the cue terminates the trial, making the current fixation the final one, and (3) the stronger cue is more likely to be remembered.

However, this effect doesn’t occur in the random commitment model because the cue it fixates last is usually determined by the random commitment, not by crossing the threshold (only the latter being more likely for stronger cues). Comparing these two effects, we see that the normal random model captures the second but not the first, while the random comittment model captures the first but not the second. Only the optimal model and humans show both effects.

Individual fixation durations

Although the random models we designed did not demonstrate the smoothly increasing probability of fixating the stronger cue, nor the overal proportion effect, it is possible that some other confound related to the last fixation explain the effects. For this reason, it is informative to look at individual non-final fixation durations.

Because we are excluding final fixations, different participants are better represented in some part of the x axis than others. This produces Simpon’s-paradox-like effects. Furthermore, because many participants have few trials with more than two or three fixations, it is difficult to account for this with random effects models (few data points per individual). For this reason, we plot and analyze fixation durations that have first been normalized (z-scored) by the mean and SD duration for that participant across all non-final fixations (there are not substantial differences in duration by fixation number). We still employ mixed effects models to account for individual differences in sensitivity to cue strength.

First fixation

df %>% 
    filter(n_pres >= 2) %>% 
    regress(strength_first, first_pres_time, mixed=MIXED_DURATIONS)
N = 1047
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) -0.042 0.030 -1.408 83.208 0.163
strength_first 0.036 0.022 1.626 60.471 0.109
p values calculated using Satterthwaite d.f.

Second fixation

df %>% 
    filter(n_pres >= 3) %>% 
    regress(rel_strength, second_pres_time, mixed=MIXED_DURATIONS)
N = 505
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.102 0.067 1.521 49.299 0.135
rel_strength -0.111 0.037 -2.987 450.436 0.003
p values calculated using Satterthwaite d.f.

Third fixation

df %>% 
    filter(n_pres >= 4) %>% 
    regress(rel_strength, third_pres_time, mixed=MIXED_DURATIONS)
N = 102
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.380 0.196 1.937 27.413 0.063
rel_strength 0.375 0.109 3.432 27.458 0.002
p values calculated using Satterthwaite d.f.

The second and third fixations show strong and highly significant effects. The first fixation is much weaker and only trending towards significance. However, for some parameter settings (not the plotted ones), the optimal model predicts a weak first-fixation effect. I haven’t dug into why/when this is the case, but I think we can probably come up with an explanation for why we didn’t get this effect if indeed we don’t.

Sanity checks for evidence accumulation

These plots test basic predictions of the evidence accumulation model, that is, effects that come out in the random model.

These plots only include correct trials.

Probability of remembering first word

df %>%
    filter(response_type == "correct") %>% 
    mutate(choose_first = int(choose_first)) %>% 
    regress(rel_strength, choose_first) +
    ylab("Prob Select First Cue") + ylim(0, 1)
N = 1606
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 0.635 0.021 30.758 96.192 0.000
rel_strength 0.181 0.008 21.595 95.294 0.000
p values calculated using Satterthwaite d.f.

Reaction time by chosen strength

df %>% 
    filter(response_type == "correct") %>% 
    regress(chosen_strength, rt)
N = 1606
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 3225.829 102.421 31.496 101.613 0.000
chosen_strength -547.433 91.857 -5.960 86.179 0.000
p values calculated using Satterthwaite d.f.

Last fixation duration by strength

df %>% 
    filter(response_type == "correct") %>% 
    filter(n_pres > 0) %>% 
    mutate(
        last_pres_time = map_dbl(presentation_times, last),
        last_rel_strength = if_else(last_pres == "first", rel_strength, 1 - rel_strength),
        last_strength = if_else(last_pres == "first", strength_first, strength_second)
    ) %>% 
    regress(last_strength, last_pres_time)
N = 1606
Fixed Effects
Est. S.E. t val. d.f. p
(Intercept) 1714.533 70.473 24.329 102.386 0.000
last_strength -265.268 67.723 -3.917 101.124 0.000
p values calculated using Satterthwaite d.f.

Miscellaneous results

Response types

Note: The model can only predict correct answers and timeouts.

raw_df %>% 
    with((proportions(table(name, response_type), margin=1))) %>% 
    kable(digits=2)
correct intrusion other timeout empty
Optimal 0.96 0.00 0.00 0.04 0.00
Human 0.88 0.06 0.03 0.01 0.02
Random 0.89 0.00 0.00 0.11 0.00
Random Commitment 0.88 0.00 0.00 0.12 0.00

Reaction time

df %>% 
    ggplot(aes(rt)) +
    geom_density() +
    facet_grid(~name) +
    labs(x="Reaction Time", y="Density")

Number of fixations

df %>% 
    filter(n_pres < 10) %>% 
    ggplot(aes(n_pres, ..prop..)) +
    geom_bar() +
    facet_grid(~name) +
    labs(x="Number of Fixations", y="Proportion of Trials")